Introducing SETA: Open Source RL Environments for Terminal Agents
Explore SETA, a toolkit with 400 RL tasks tailored for terminal agents.
Records found: 97
Explore SETA, a toolkit with 400 RL tasks tailored for terminal agents.
Salesforce AI's xRouter uses reinforcement learning and a cost-aware reward to route queries among 20+ LLMs, approaching top-model accuracy while cutting offloading costs dramatically
'Grok 4.1 brings two modes that top LMArena leaderboards, boosts perceived helpfulness and cuts hallucinations for info queries while exposing alignment tradeoffs in deception and sycophancy.'
'Meta's DreamGym synthesizes environment interactions as text using a reasoning experience model and grounded replay memory, cutting real rollouts and boosting RL performance across web benchmarks.'
'A compact neural agent learns to plan, store and compose symbolic tools end-to-end with reinforcement learning, demonstrating emergent multi-step reasoning on synthetic arithmetic tasks.'
'SkyRL tx v0.1.0 brings a Tinker-compatible training and inference engine to local GPU clusters, adding end-to-end RL support, faster sampling and Postgres support.'
'DeepAgent merges thinking, tool search, calls, and memory compression into a single reasoning stream, enabling dynamic tool discovery across tens of thousands of APIs and improved long-horizon performance.'
'Microsoft open sourced Agent Lightning to convert agent execution traces into RL transitions, enabling training of LLM policies with minimal integration and support for standard RL trainers.'
'Learn how to create a custom trading environment and train multiple RL agents with Stable-Baselines3, then evaluate and visualize their performance to find the best strategy.'
'UltraCUA introduces a hybrid action model that lets agents mix GUI primitives with programmatic tool calls, improving success and reducing steps across desktop automation benchmarks.'
'W4S trains a 7B meta-agent to program Python workflows that call stronger LLM executors, using offline RL to iteratively generate, execute, and refine solutions. The approach yields consistent gains across 11 benchmarks and achieves Pass@1 of 95.4 on HumanEval with GPT-4o-mini.'
'QeRL uses NVFP4 weight quantization plus LoRA and AQN to boost rollout throughput and exploration, allowing a 32B policy to be trained on a single H100 with competitive accuracy.'
'RA3 formalizes mid-training as pruning plus horizon shortening and uses temporal action abstractions to accelerate RL post-training, boosting code generation benchmarks.'
'AgentFlow introduces a modular Planner–Executor–Verifier–Generator architecture and Flow-GRPO, a token-level on-policy method that trains only the Planner, reporting substantial gains across ten benchmarks and an open-source MIT implementation.'
'MoonshotAI released checkpoint-engine, a middleware that updates model weights across thousands of GPUs in about 20 seconds, enabling fast RL and large-scale LLM serving with minimal downtime.'
'MIT shows that on-policy reinforcement learning preserves prior capabilities better than supervised fine-tuning by minimizing forward KL divergence between the base and fine-tuned models.'
'Biomni-R0 applies end-to-end reinforcement learning and expert rewards to train 8B and 32B biomedical agents that outperform much larger general models on multi-step reasoning tasks.'
'Alibaba Qwen team introduced GUI-Owl and Mobile-Agent-v3, a unified multimodal agent and multi-agent framework that automates GUI tasks across mobile and desktop with state-of-the-art benchmark performance.'
'Zhipu AI's ComputerRL combines programmatic APIs with GUI actions and a scalable RL infrastructure to build more capable desktop agents. Experimental results show strong gains on the OSWorld benchmark, driven by the API-GUI paradigm and the Entropulse training method.'
'Midcentury pigeon experiments by B.F. Skinner inspired the associative learning ideas that underpin modern reinforcement learning, reshaping both AI and how scientists view animal intelligence.'
'ToolTrain teaches LLMs to use simple repository tools and combines SFT with tool-integrated RL to improve multi-hop issue localization, delivering state-of-the-art results on real-world benchmarks.'
Nebius AI and Humanoid adapted DAPO-based reinforcement learning to train an open-weight Qwen2.5 agent for long-horizon software engineering, reaching 39% Pass@1 on SWE-bench Verified without teacher supervision.
ProRLv2 scales RL training to 3,000 steps and combines regularization and exploration techniques to expand reasoning capabilities in compact LLMs, showing strong benchmark gains across math, coding, logic and STEM tasks.
'AI pricing tools can produce tacit collusion like outcomes, challenging traditional antitrust frameworks and prompting new enforcement, legislation, and transparency measures.'
'Graph-R1 combines hypergraph knowledge, agentic multi-turn retrieval, and end-to-end RL to deliver state-of-the-art QA accuracy and efficient generation.'
ByteDance introduces Seed-Prover, a novel lemma-centric system that achieves breakthrough results in automated mathematical theorem proving, solving 5 out of 6 IMO 2025 problems and excelling across multiple benchmarks.
MiroMind-M1 introduces an open-source pipeline for advanced mathematical reasoning, leveraging a novel multi-stage reinforcement learning approach to achieve state-of-the-art performance and transparency.
'Rubrics as Rewards (RaR) introduces a reinforcement learning approach that uses structured rubrics as reward signals, improving language model training in complex domains like medicine and science.'
Alibaba introduces Qwen3-MT, a next-generation multilingual machine translation model featuring cutting-edge architecture and reinforcement learning for high-quality, cost-efficient translations across 92+ languages.
Master-RM is a new reward model designed to fix vulnerabilities in LLM-based evaluators by reducing false positives caused by superficial cues, ensuring more reliable reinforcement learning outcomes.
MemAgent introduces a reinforcement learning-based memory agent that allows large language models to process ultra-long documents efficiently, maintaining high accuracy with linear computational costs.
GLM-4.1V-Thinking is a cutting-edge vision-language model that pushes the boundaries of multimodal reasoning, setting new standards across various challenging AI tasks.
Mirage introduces a new method for vision-language models to integrate visual reasoning without generating images, significantly enhancing their ability to solve spatial and multimodal tasks.
Apple and the University of Hong Kong introduce DiffuCoder, a 7-billion parameter diffusion model designed specifically for code generation, demonstrating promising results and novel training methods.
MMSearch-R1 introduces a reinforcement learning framework that enables large multimodal models to perform efficient, on-demand searches by learning when and how to retrieve relevant information, significantly improving accuracy and reducing search overhead.
Embodied AI agents leverage world models to perceive and act in real or virtual environments, enhancing their autonomy and human-like interaction across various industries.
Salesforce AI releases GTA1, a powerful GUI agent that outperforms OpenAI's CUA by leveraging innovative test-time scaling and reinforcement learning techniques to improve task success and action grounding.
Meta and NYU developed a semi-online reinforcement learning method that balances offline and online training to enhance large language model alignment, boosting performance in both instruction-based and mathematical tasks.
AbstRaL uses reinforcement learning to teach LLMs abstract reasoning, significantly improving their robustness and accuracy on varied GSM8K math problems compared to traditional methods.
ASTRO, a novel post-training method, significantly enhances Llama 3's reasoning abilities by teaching search-guided chain-of-thought and self-correction, achieving up to 20% benchmark gains.
Together AI has launched DeepSWE, an open-source, reinforcement learning-trained coding agent based on Qwen3-32B, achieving top scores on the SWEBench benchmark and setting new standards for autonomous software engineering AI.
'ReasonFlux-PRM is a new trajectory-aware reward model that evaluates both reasoning steps and final answers in large language models, significantly improving their reasoning capabilities and training outcomes.'
OMEGA is a novel benchmark designed to probe the reasoning limits of large language models in mathematics, focusing on exploratory, compositional, and transformational generalization.
'LongWriter-Zero introduces a novel reinforcement learning framework that enables ultra-long text generation without synthetic data, achieving state-of-the-art results on multiple benchmarks.'
Tencent introduces Hunyuan-A13B, a highly efficient open-source MoE language model with dual-mode reasoning and support for ultra-long 256K context lengths, achieving state-of-the-art benchmark results.
Unbabel introduces TOWER+, a unified multilingual large language model that excels in both high-fidelity translation and instruction-following, surpassing existing open-weight models in benchmarks.
Polaris-4B and Polaris-7B introduce a novel reinforcement learning recipe that scales reasoning capabilities efficiently, achieving state-of-the-art results on math benchmarks with smaller models.
GURU introduces a multi-domain reinforcement learning dataset and models that significantly improve reasoning abilities of large language models across six diverse domains, outperforming previous open models.
MIT and NUS researchers introduce MEM1, a reinforcement learning framework that enables language agents to efficiently manage memory during complex multi-turn tasks, outperforming larger models in speed and resource use.
ByteDance researchers introduce ProtoReasoning, a new framework leveraging logic-based prototypes to significantly improve reasoning and planning abilities in large language models across various domains.
PoE-World introduces a modular symbolic approach that surpasses traditional reinforcement learning methods in Montezuma’s Revenge with minimal data, enabling efficient planning and strong generalization.
MiniMax AI has unveiled MiniMax-M1, a 456B parameter hybrid model optimized for long-context processing and reinforcement learning, offering significant improvements in scalability and efficiency.
'New research reveals DeepSeek as the most flexible AI chatbot willing to engage in explicit sexual conversations, contrasting with stricter models like Claude and GPT-4o.'
ReVisual-R1 is an innovative open-source 7B multimodal language model that advances complex reasoning by integrating a three-stage training pipeline with novel reinforcement learning techniques.
DeepCoder-14B is an open-source AI model designed for efficient and transparent code generation, matching proprietary models in performance while promoting collaboration and accessibility.
Internal Coherence Maximization (ICM) introduces a novel label-free, unsupervised training framework for large language models, achieving performance on par with human-supervised methods and enabling advanced capabilities without human feedback.
Large Language Models often skip parts of complex instructions due to attention limits and token constraints. This article explores causes and practical tips to improve instruction adherence.
CURE is a novel self-supervised reinforcement learning framework that enables large language models to co-evolve code and unit test generation, significantly enhancing performance and efficiency without requiring ground-truth code.
Meta has introduced LlamaRL, an innovative scalable and asynchronous reinforcement learning framework built in PyTorch that dramatically speeds up training of large language models while optimizing resource use.
NVIDIA introduces ProRL, a novel reinforcement learning method that extends training duration to unlock new reasoning capabilities in AI models, achieving superior performance across multiple reasoning benchmarks.
Shanghai AI Lab researchers propose entropy-based scaling laws and novel techniques to overcome exploration collapse in reinforcement learning for reasoning-centric large language models, achieving significant performance improvements.
MiMo-VL-7B is a powerful vision-language model developed by Xiaomi researchers, offering state-of-the-art performance in visual understanding and multimodal reasoning through advanced training techniques.
Researchers introduce Regularized Policy Gradient (RPG), a novel framework leveraging KL divergence in off-policy reinforcement learning to significantly improve reasoning and training stability in large language models.
Enigmata introduces a comprehensive toolkit and training strategies that significantly improve large language models' abilities in puzzle reasoning using reinforcement learning with verifiable rewards.
Apple and Duke researchers introduce Interleaved Reasoning, a reinforcement learning method that allows LLMs to produce intermediate answers, significantly boosting response speed and accuracy in complex tasks.
Qwen2.5-Math models improve math reasoning significantly even when trained with incorrect or random reward signals, highlighting unique reinforcement learning dynamics not seen in other models.
MMaDA is a novel unified multimodal diffusion model that excels in textual reasoning, visual understanding, and image generation, outperforming existing systems across multiple benchmarks.
Microsoft's Phi-4-reasoning demonstrates that high-quality, curated data can enable smaller AI models to perform advanced reasoning tasks as effectively as much larger models, challenging the notion that bigger models are always better.
QwenLong-L1 introduces a structured reinforcement learning approach enabling large language models to excel at long-context reasoning tasks, achieving state-of-the-art results on multiple benchmarks.
NVIDIA introduces Llama Nemotron Nano 4B, a compact open-source AI model optimized for edge deployment that outperforms larger models in scientific reasoning and programming tasks.
GRIT introduces a groundbreaking method for teaching multimodal large language models to jointly reason with images and text, significantly improving visual grounding and reasoning accuracy while requiring minimal training data.
Researchers have developed a reinforcement learning framework that enables LLMs to optimize assembly code beyond traditional compilers, achieving a 1.47× speedup and 96% correctness on thousands of real-world programs.
Researchers from the National University of Singapore developed Thinkless, a framework that dynamically adjusts reasoning depth in language models, cutting unnecessary computation by up to 90% while maintaining accuracy.
Researchers improve large language models' reasoning by explicitly aligning core abilities like deduction, induction, and abduction, surpassing traditional instruction-tuned models in accuracy and reliability.
RXTX is a novel machine learning-based algorithm that achieves faster and more efficient structured matrix multiplication, outperforming existing methods including recursive Strassen techniques.
NVIDIA introduces Cosmos-Reason1, a new suite of AI models designed to enhance physical common sense and embodied reasoning using multimodal learning and innovative ontologies, improving AI interaction in real-world environments.
Anthropic’s research exposes critical gaps in how AI models explain their reasoning via chain-of-thought prompts, showing frequent omissions of key influences behind decisions.
DanceGRPO introduces a unified reinforcement learning framework that enhances visual generation across multiple paradigms and tasks, significantly improving visual quality and alignment with human preferences.
NVIDIA's Joey Conway discusses groundbreaking open-source AI models Llama Nemotron Ultra and Parakeet, highlighting innovations in reasoning control, data curation, and rapid speech recognition.
New research shows that including toxic data in LLM pretraining improves the model's ability to be detoxified and controlled, leading to safer and more robust language models.
Nemotron-Tool-N1 introduces a novel reinforcement learning approach enabling large language models to effectively use external tools with minimal supervision, outperforming existing fine-tuned models on key benchmarks.
RLV introduces a unified framework that integrates verification into value-free reinforcement learning for language models, significantly improving reasoning accuracy and computational efficiency on mathematical reasoning benchmarks.
'Alibaba’s ZeroSearch framework leverages reinforcement learning and simulated document generation to train language models for retrieval without relying on costly real-time search APIs, achieving performance comparable to or better than Google Search.'
'Microsoft Research has developed ARTIST, a reinforcement learning framework that empowers LLMs to use external tools dynamically, significantly improving performance on complex reasoning tasks.'
Salesforce’s xGen-small offers a compact AI model delivering efficient long-context understanding with reduced costs and strong privacy, transforming enterprise AI workflows.
DeepSeek-Prover-V2 bridges informal intuition and formal math proofs, achieving strong benchmark results and offering open-source access to revolutionize AI-driven mathematical reasoning.
OpenAI launches Reinforcement Fine-Tuning on the o4-mini model, enabling developers to customize AI reasoning with precision using reinforcement learning techniques.
WebThinker is a new AI agent that empowers large reasoning models to autonomously search the web and generate detailed scientific reports, significantly improving performance on complex reasoning benchmarks.
NVIDIA, CMU, and Boston University researchers introduce Nemotron-CrossThink, a novel framework that expands reinforcement learning for large language models beyond math to multiple reasoning domains with improved accuracy and efficiency.
Researchers at UC Berkeley and UCSF have developed Adaptive Parallel Reasoning, a novel method that allows large language models to dynamically distribute inference tasks across parallel threads, enhancing reasoning performance without exceeding context window limits.
Researchers introduce StarPO-S and RAGEN frameworks, significantly improving stability and reasoning capabilities in training autonomous large language model agents for multi-turn interactive tasks.
Xiaomi's MiMo-7B is a compact language model that surpasses larger models in math and code reasoning through advanced pre-training and reinforcement learning strategies.
DeepSeek-AI released DeepSeek-Prover-V2, an open-source large language model designed for formal theorem proving using subgoal decomposition and reinforcement learning, achieving state-of-the-art results on multiple formal reasoning benchmarks.
Microsoft launched the Phi-4-Reasoning family, a set of 14B parameter open-weight models optimized for complex reasoning tasks. These models demonstrate competitive performance on math, planning, and coding challenges with transparent training and open access.
OpenPipe’s ART·E uses reinforcement learning to deliver faster, cheaper, and more accurate email question-answering, outperforming OpenAI’s o3 agent in key metrics.
USC researchers introduce Tina, a family of compact reasoning models that leverage LoRA and reinforcement learning to deliver strong multi-step reasoning performance at a fraction of typical training costs.
Skywork AI introduces R1V2, a cutting-edge multimodal reasoning model that blends hybrid reinforcement learning techniques to improve specialized reasoning and generalization, outperforming many open-source and proprietary models.